Lab 04b: Extracting text features

Introduction

This lab demonstrates feature extraction with text data. At the end of the lab, you should be able to use pandas and scikit-learn to:

Extract TF-IDF features from text data.

Getting started

Let's start by importing the packages we'll need. As usual, we'll import pandas for exploratory analysis, but this week we're also going to use scikit-learn (sklearn), a modelling and machine learning library for Python.



In [ ]:

    
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

Next, let's load the data. Write the path to your sms.csv file in the cell below:



In [ ]:

    
data_file = 'data/sms.csv'

Execute the cell below to load the CSV data into a pandas data frame with the columns label and message.

Note: This week, the CSV file is not comma separated, but instead tab separated. We can tell pandas about the different format using the sep argument, as shown in the cell below. For more information, see the read_csv documentation.



In [ ]:

    
sms = pd.read_csv(data_file, sep='\t', header=None, names=['label', 'message'])
sms.head()

Extracting text features

As can be seen, our data is in the form of raw text. To make it work with machine learning algorithms, we'll need to transform the data into a numerical representation. One popular way to do this with text data is to compute term frequency (TF) and inverse document frequency (IDF) measures:

Term frequency is a measure of how often a given term appears in a given document, e.g. how often the word "free" appears in a given SMS message. The more often a word appears in a document, the higher its term frequency.
Inverse document frequency is a measure of how rare a word is in a set of documents, e.g. the word "the" appears commonly in many SMS messages and so its presence (or absence) provides little information when it comes to distinguishing spam from ham. The higher the inverse document frequency of a word, the rarer it is (and, therefore, the more distinguishing power it has).

Typically, term frequency and inverse document frequency are combined as a single metric, term frequency-inverse document frequency (TF-IDF), which is simply the multiple of the individual values. Consequently, if a term has a high TF-IDF score, its presence across a set of documents (e.g. SMS messages) is low, while its number of occurrences in a given document (e.g. a candidate SMS message under evaluation) is high. If a term has a low TF-IDF score, this is an indicator that it doesn't appear very frequently in a given document, occurs very frequently across the set of documents, or both. We can exploit this information to find terms that can distinguish a certain set of documents (e.g. spam) from a larger set of documents (more on this in later labs!).

We can compute the TF-IDF score for each word in each message using the TfidfVectorizer class from scikit-learn:



In [ ]:

    
tfidf = TfidfVectorizer()
matrix = tfidf.fit_transform(sms['message'])

The resulting matrix has the same number of rows as the input SMS data, but it has thousands of columns - each one corresponding to a new feature:



In [ ]:

    
matrix.shape

This might seem a bit confusing at first, but it makes sense when you think about it: the rows of the matrix correspond to our original messages, while the columns of the matrix correspond to the words in those messages, and so the values in the cells of the matrix are the TF-IDF scores for each word. As not every word appears in every message, some values are empty - this is known as a sparse matrix.

We can take a look at the corresponding word feature indices via the vocabulary_ attribute of TfidfVectorizer:



In [ ]:

    
tfidf.vocabulary_

As can be seen below, the vocabulary has the same number of items as there are columns in the matrix:



In [ ]:

    
len(tfidf.vocabulary_)

Finally, we can examine the TF-IDF score for any combination of message and word by checking the corresponding entry in the matrix. For instance, to see the TF-IDF score for the word "only" in the first message in our data frame, we can write:



In [ ]:

    
row = 0
col = tfidf.vocabulary_['only']

print('Message: "%s"' % sms.loc[row, 'message'])
print('TF-IDF score: %f' % matrix[row, col])

If a word isn't in a message, it's TF-IDF score will be zero:



In [ ]:

    
row = 1
col = tfidf.vocabulary_['only']

print('Message: "%s"' % sms.loc[row, 'message'])
print('TF-IDF score: %f' % matrix[row, col])

In a future lab, we'll use TF-IDF features to build a spam classifier.